Building an Evaluation Scale using Item Response Theory

نویسندگان

  • John P. Lalor
  • Hao Wu
  • Hong Yu
چکیده

Evaluation of NLP methods requires testing against a previously vetted gold-standard test set and reporting standard metrics (accuracy/precision/recall/F1). The current assumption is that all items in a given test set are equal with regards to difficulty and discriminating power. We propose Item Response Theory (IRT) from psychometrics as an alternative means for gold-standard test-set generation and NLP system evaluation. IRT is able to describe characteristics of individual items - their difficulty and discriminating power - and can account for these characteristics in its estimation of human intelligence or ability for an NLP task. In this paper, we demonstrate IRT by generating a gold-standard test set for Recognizing Textual Entailment. By collecting a large number of human responses and fitting our IRT model, we show that our IRT model compares NLP systems with the performance in a human population and is able to provide more insight into system performance than standard evaluation metrics. We show that a high accuracy score does not always imply a high IRT score, which depends on the item characteristics and the response pattern.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Psychometric Properties of State Level Subjective Vitality Scale based on classical test theory and Item-response theory

The purpose of the present study was to investigate the factor structure and Item-Response parameters of State Level of Subjective Vitality Scale. The research design was correlational, and the statistical population consisted of students of the Shahid Beheshti University of Tehran. Sample group including 240 students were selected through multi-stage sampling and completed Subjective Vitality ...

متن کامل

Psychometric Properties of the Brief Form of Professor-Students Rapport Scale-based on Classical Test Theory and Item-Response Theory

Introduction: In order to improve the quality of the teaching process, it is necessary to review the professor-student rapport. The purpose of the present study was to investigate the factor structure and item-response parameters of Professor-Students Rapport Scale-Brief (PSRS-B). Methods: In a descriptive-correlation study, 497 students from Shahid Beheshti University of Medical Sciences were ...

متن کامل

Evaluation Psychometric Characteristics of the Persian Version of the Colorado Learning Attitudes about Science Survey Using polytomous Item Response Model

Goal: Researchers in the field of science education believe that peoplechr(chr('39')39chr('39'))s attitudes about learning will have a significant impact on their future learning and what they learn from science will not be unrelated to their views and attitudes. Accordingly, most questionnaires have been developed to measure attitudes toward science, especially about physics learning attitudes...

متن کامل

Evaluation of TLD Performance in Reducing the Seismic Response of Structures

Tuned Liquid Dampers (TLD) are among passive control devices that have been used to suppress the vibration of structures in recent years. These structures must be adequately presentable as an equivalent single degree of freedom system with long fundamental period. The TLD, located at the top floors of the structure, can dissipate the external input energy into the system through the sloshing ef...

متن کامل

Evaluation of TLD Performance in Reducing the Seismic Response of Structures

Tuned Liquid Dampers (TLD) are among passive control devices that have been used to suppress the vibration of structures in recent years. These structures must be adequately presentable as an equivalent single degree of freedom system with long fundamental period. The TLD, located at the top floors of the structure, can dissipate the external input energy into the system through the sloshing ef...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • Proceedings of the Conference on Empirical Methods in Natural Language Processing. Conference on Empirical Methods in Natural Language Processing

دوره 2016  شماره 

صفحات  -

تاریخ انتشار 2016